Using Thesauri for Automatic Indexing and for the Visualisation of Multilingual Document Collections
نویسندگان
چکیده
This article presents an approach for cross-language document comparison and for the visualisation of multilingual document collections. Document comparison usually relies on the calculation of the degree of lexical overlap between documents. As this is not possible for documents written in different languages, the contents of these documents first have to be mapped onto a language-independent representation. The JRC’s statistical tool for controlled vocabulary keyword assignment assigns descriptors of the multilingual Eurovoc thesaurus, which can be used for cross-language document comparison. The language-independent sets of thesaurus descriptors allow to identify, for a given document, the most similar documents even if they are written in different languages. They furthermore allow to organise and to visualise the structure and approximate contents of whole multilingual document collections in two-dimensional document maps.
منابع مشابه
Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملAutomatic annotation of multilingual text collections with a conceptual thesaurus
Automatic annotation of documents with controlled vocabulary terms (descriptors) from a conceptual thesaurus is not only useful for document indexing and retrieval. The mapping of texts onto the same thesaurus furthermore allows to establish links between similar documents. This is also a substantial requirement of the Semantic Web. This paper presents an almost language-independent system that...
متن کاملAutomatic Multi-label Subject Indexing in a Multilingual Environment
This paper presents an approach to automatically subject index fulltext documents with multiple labels based on binary support vector machines (SVM). The aim was to test the applicability of SVMs with a real world dataset. We have also explored the feasibility of incorporating multilingual background knowledge, as represented in thesauri or ontologies, into our text document representation for ...
متن کاملSemantic Indexing of Multilingual Corpora and its Application on the History Domain
The increasing amount of multilingual text collections available in different domains makes its automatic processing essential for the development of a given field. However, standard processing techniques based on statistical clues and keyword searches have clear limitations. Instead, we propose a knowledge-based processing pipeline which overcomes most of the limitations of these techniques. T...
متن کاملAutomatic Multilingual Indexing and Natural Language Processing
The number of documents being collected by information brokers such as bibliographic database producers, libraries and publishers increases rapidly. The consequence is a huge demand for indexing and classification. So far this has had to be carried out manually. The system AUTINDEX, which is described in this paper offers tools for monolingual as well as for multilingual automatic indexing and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000